Enable to set persistent_workers through CLI #126

aditya0by0 · 2025-10-08T16:13:45Z

This PR addresses a memory accumulation issue observed during training with the ResGatedDynamicGNI model when persistent_workers=True is enabled in the DataLoader.

⚙️ Existing Problem

The ResGatedDynamicGNI model performs per-forward random feature initialization for both node and edge features (new_x, new_edge_attr) on the GPU.
When combined with persistent DataLoader workers, these per-batch random allocations are not released properly because:
- Worker processes remain alive across epochs.
- CUDA’s caching allocator retains fragmented memory blocks.

While setting persistent_workers=True can improve performance when input features remain constant throughout training (as noted in the Lightning documentation), it becomes problematic when input features are dynamically initialized in each forward pass. In such cases, the DataLoader workers retain these transient tensors in memory, expecting reuse across epochs. Since they are never reused, this leads to progressive GPU memory accumulation and can eventually cause out-of-memory (OOM) errors. See the related issue and logs here
.

Refer: https://lightning.ai/docs/pytorch/stable/advanced/speed.html#persistent-workers

🧠 Root Cause

persistent_workers=True keeps worker subprocesses alive between epochs, retaining CUDA contexts and cached memory allocations that the ResGatedDynamicGNI model reinitializes each forward pass.

🔧 Fix Implemented

Enabled to set persistent_workers=False in all DataLoader through CLI for the ResGatedDynamicGNI model training.
This ensures that:
- Workers are restarted cleanly each epoch.
- GPU and CPU memory are fully released after each epoch.
- Memory fragmentation and accumulation are avoided.
Default is set to True as before, ensuring no disruption for other existing pipelines

sfluegel05 · 2025-10-14T10:56:20Z

I don't see how the run you linked refers to any out of memory issues. The GPU memory allocation is at a constant 7.2% for the whole run.

Aside from that, having this option won't hurt so I am merging this.

aditya0by0 · 2025-10-14T17:07:01Z

This error in logs of the runs is related to the multiple processes. I was able to find some discussion related to it and it seems to be related to pytorch and python:

persistent workers can be set through CLI for GNI

83fe459

aditya0by0 requested a review from sfluegel05 October 8, 2025 16:14

aditya0by0 self-assigned this Oct 8, 2025

typehint for argparse

c72a2d9

sfluegel05 approved these changes Oct 14, 2025

View reviewed changes

sfluegel05 merged commit d52b422 into dev Oct 14, 2025
5 checks passed

sfluegel05 deleted the fix/persistent_workers branch October 14, 2025 10:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable to set persistent_workers through CLI #126

Enable to set persistent_workers through CLI #126

Uh oh!

aditya0by0 commented Oct 8, 2025 •

edited

Loading

Uh oh!

sfluegel05 commented Oct 14, 2025

Uh oh!

Uh oh!

aditya0by0 commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Enable to set persistent_workers through CLI #126

Enable to set persistent_workers through CLI #126

Uh oh!

Conversation

aditya0by0 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚙️ Existing Problem

🧠 Root Cause

🔧 Fix Implemented

Uh oh!

sfluegel05 commented Oct 14, 2025

Uh oh!

Uh oh!

aditya0by0 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya0by0 commented Oct 8, 2025 •

edited

Loading

aditya0by0 commented Oct 14, 2025 •

edited

Loading